Skip to content

feat: implement DuckDB filesystem integration for Vortex file handling#6198

Merged
0ax1 merged 5 commits into
vortex-data:developfrom
Lychee-Technology:develop
Feb 16, 2026
Merged

feat: implement DuckDB filesystem integration for Vortex file handling#6198
0ax1 merged 5 commits into
vortex-data:developfrom
Lychee-Technology:develop

Conversation

@iceboundrock

@iceboundrock iceboundrock commented Jan 29, 2026

Copy link
Copy Markdown
Contributor

Comment thread vortex-duckdb/src/scan.rs Outdated
@joseph-isaacs joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label Jan 29, 2026
@codspeed-hq

codspeed-hq Bot commented Jan 29, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 1057 untouched benchmarks
⏩ 1346 skipped benchmarks1


Comparing Lychee-Technology:develop (8354deb) with develop (3029e9b)

Open in CodSpeed

Footnotes

  1. 1346 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@0ax1 0ax1 self-requested a review January 29, 2026 10:40
@0ax1 0ax1 added action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/feature A new feature and removed action/benchmark Trigger full benchmarks to run on this PR labels Jan 29, 2026
@joseph-isaacs joseph-isaacs added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Jan 29, 2026

@0ax1 0ax1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to look into this. There's a couple of issues around thread safety / locking, error handling and memory leaks.

If you used AI to assist in writing the code for your PR please mention this in your PR description (see: https://github.com/vortex-data/vortex/blob/develop/CONTRIBUTING.md).

Comment thread vortex-duckdb/src/copy.rs Outdated
Comment thread vortex-duckdb/cpp/file_system.cpp Outdated
Comment thread vortex-duckdb/cpp/file_system.cpp Outdated
Comment thread vortex-duckdb/cpp/file_system.cpp Outdated
Comment thread vortex-duckdb/cpp/file_system.cpp Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/copy.rs Outdated
@joseph-isaacs joseph-isaacs added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Feb 2, 2026
Comment thread vortex-duckdb/src/copy.rs Outdated
Comment thread vortex-duckdb/src/utils/glob.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/cpp/error.cpp Outdated
Comment thread vortex-duckdb/src/duckdb/client_context.rs
Comment thread vortex-duckdb/src/scan.rs Outdated

@0ax1 0ax1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, thanks for taking the time to address the comments!

Comment thread vortex-duckdb/src/scan.rs Outdated
Comment thread vortex-duckdb/cpp/include/duckdb_vx/file_system.h Outdated
@ruoshili ruoshili force-pushed the develop branch 2 times, most recently from 66fedc0 to b6e15a3 Compare February 11, 2026 02:21
@iceboundrock

Copy link
Copy Markdown
Contributor Author

Almost there, thanks for taking the time to address the comments!

Hi @0ax1 , any other comments?

@0ax1

0ax1 commented Feb 11, 2026

Copy link
Copy Markdown
Contributor

Almost there, thanks for taking the time to address the comments!

Hi @0ax1 , any other comments?

Currently swamped with other work but will get back to it when I find a free moment.

Comment thread vortex-duckdb/src/scan.rs Outdated
@gatesn

gatesn commented Feb 13, 2026

Copy link
Copy Markdown
Contributor

@iceboundrock - quick question on this. Do you know if it's possible to grab the config back out from DuckDB's filesystems?

I don't want to block this PR, don't worry(!), but we do have plans to implement a high performance I/O layer at some point and it would probably make more sense to try and configure our own I/O using duckdb configs, rather than attempt to optimize calling back into DuckDB's I/O layer.

@gatesn gatesn added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Feb 13, 2026
@iceboundrock

iceboundrock commented Feb 13, 2026

Copy link
Copy Markdown
Contributor Author

@iceboundrock - quick question on this. Do you know if it's possible to grab the config back out from DuckDB's filesystems?

I don't want to block this PR, don't worry(!), but we do have plans to implement a high performance I/O layer at some point and it would probably make more sense to try and configure our own I/O using duckdb configs, rather than attempt to optimize calling back into DuckDB's I/O layer.

Could you please elaborate on what config refers to? If it’s S3 credentials, HTTP Proxy configuration, etc., I don’t believe we can read them from DuckDB. They are stored in Secrets Manager and the read API redacts sensitive information.

We probably can do the same as the httpfs extension does. Code link:

@0ax1 0ax1 added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Feb 16, 2026
@0ax1

0ax1 commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

@iceboundrock - quick question on this. Do you know if it's possible to grab the config back out from DuckDB's filesystems?
I don't want to block this PR, don't worry(!), but we do have plans to implement a high performance I/O layer at some point and it would probably make more sense to try and configure our own I/O using duckdb configs, rather than attempt to optimize calling back into DuckDB's I/O layer.

Could you please elaborate on what config refers to? If it’s S3 credentials, HTTP Proxy configuration, etc., I don’t believe we can read them from DuckDB. They are stored in Secrets Manager and the read API redacts sensitive information.

We probably can do the same as the httpfs extension does. Code link:

* [1](https://github.com/duckdb/duckdb-httpfs/blob/6afe39d5be688b032146baa97e49601b50c9456e/src/httpfs_extension.cpp#L73-L105)

* [2](https://github.com/duckdb/duckdb-httpfs/blob/main/src/create_secret_functions.cpp)

Let's handle this in a follow up to limit the scope of this PR.

@0ax1 0ax1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great and CI (including all benchmarks) pass.

Last bit of fine-tuning and this is ready to land.

Comment thread vortex-duckdb/cpp/file_system.cpp Outdated
Comment thread vortex-duckdb/src/duckdb/client_context.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
Comment thread vortex-duckdb/src/duckdb/file_system.rs Outdated
iceboundrock and others added 3 commits February 16, 2026 08:28
…ated function signatures

Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
…ety and update write/flush methods to use spawn_blocking

Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
@0ax1 0ax1 enabled auto-merge (squash) February 16, 2026 18:17

@0ax1 0ax1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! @iceboundrock 🎉

@0ax1 0ax1 merged commit 483191d into vortex-data:develop Feb 16, 2026
50 checks passed
@gatesn

gatesn commented Feb 17, 2026

Copy link
Copy Markdown
Contributor

Looks like this is materially worse for S3 performance:

Screenshot 2026-02-17 at 09 19 44 https://bench.vortex.dev/

[edit] looks like the coalescing configs are different from what ObjectStore uses. I think we just use the same values for now until we can properly investigate how best to tune these things.

@0ax1

0ax1 commented Feb 17, 2026

Copy link
Copy Markdown
Contributor

Looks like this is materially worse for S3 performance: https://bench.vortex.dev/
[edit] looks like the coalescing configs are different from what ObjectStore uses. I think we just use the same values for now until we can properly investigate how best to tune these things.

Will follow up on that!

@iceboundrock

iceboundrock commented Feb 18, 2026

Copy link
Copy Markdown
Contributor Author

Looks like this is materially worse for S3 performance:

Screenshot 2026-02-17 at 09 19 44 https://bench.vortex.dev/
[edit] looks like the coalescing configs are different from what ObjectStore uses. I think we just use the same values for now until we can properly investigate how best to tune these things.

Sorry, would you mind letting me know what are coalescing configs?

And I think it may be because the new version needs to install & load httpfs extension first. If you agree, I can make a PR to add the commands to the benchmark script.

@gatesn

gatesn commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Oh that's an interesting point. Definitely worth us pre-loading the extension, although it ships with core DuckDB so there wouldn't be any download.

Coalescing configs determine how to combine small individual requests to S3 into fewer larger requests to avoid the overhead of each request. We have made a PR to match them up to the old ones for one.

The odd thing for me is that the benchmarks didn't regress deterministically, they bounce up and down.

@0ax1

0ax1 commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Sorry, would you mind letting me know what are coalescing configs?

And I think it may be because the new version needs to install & load httpfs extension first. If you agree, I can make a PR to add the commands to the benchmark script.

The install and load was addressed here: #6565. And the coalescing was fixed here: #6555. Must be sth else in the mix why the S3 perf regressed.

@iceboundrock

Copy link
Copy Markdown
Contributor Author

Oh that's an interesting point. Definitely worth us pre-loading the extension, although it ships with core DuckDB so there wouldn't be any download.

Coalescing configs determine how to combine small individual requests to S3 into fewer larger requests to avoid the overhead of each request. We have made a PR to match them up to the old ones for one.

The odd thing for me is that the benchmarks didn't regress deterministically, they bounce up and down.

Thanks @gatesn .

Based on my experience, even DuckDB doesn’t download the extensions, it still adds 2~4 seconds to my AWS Lambda (ARM 64-bit and 512MB memory) cold start time, just to install and load the parquet and httpfs extensions from my Lambda package. To address this issue, I created a project to build the static library of DuckDB with the parquet and httpfs extensions bundled in.

fastio pushed a commit to fastio/vortex that referenced this pull request Mar 10, 2026
vortex-data#6198)

* This PR addresses vortex-data#5767

---------

Signed-off-by: Ruoshi Li <iceboundrock@gmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@myrrc myrrc mentioned this pull request Jun 19, 2026
myrrc added a commit that referenced this pull request Jun 19, 2026
DuckdbFS implementation for Vortex was introduced in
#6198 as opt-out, but changed
to opt-in in #6564 due to
performance regressions.
There were multiple issues
(#6709,
#6565
#6685) associated with it
which differ from vortex's file system behaviour.

It also requires additional dependencies CI which are a blocker for 
#8486 since MacOS runner
doesn't bundle openssl for x86_64 on arm, and builds fail.

As a long term goal, calling duckdb's blocking IO inside our event loop
isn't the right abstraction. We want to allow duckdb to use its own IO
outside vortex.

Duckdb fs is  also not maintaned actively so we're removing it

Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants